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MINIMUM  CROSS-ENTROPY  PATTERN 
CLASSIFICATION  AND  CLUSTER  ANALYSIS 

I.  INTRODUCTION  AND  STATEMENT  OF  THE  PROBLEM 

Let  _F  denote  a  "feature  vector"  of  measurements  F.  (i  *  0,1,..., M)  that 
are  made  on  a  system  for  which  the  individual  measurements  can  be  expressed  as 
expected  values  with  respect  to  some  unknown  underlying  probability  density 
function  q*(x): 

*  «v 

dx  f.(x)q*(x)  *  F.  ,  (1) 

~  1 


where  the  f.  are  known  functions  and  x  is  a  finite  dimensional  vector.  This 

l  ~ 

A(t) 

paper  considers  the  problem  of  classifying  F  by  identifying  the  vector  F 
that  "best  represents"  F  according  to  the  "nearest-neighbor  rule" 

D(F,F(t))  =  min  D(F,F(s))  ,  (2) 

7  ^  V  'v  7 

sfeA 

where  D  is  a  distortion  measure  and  where  {f  8  :  sfcA][  is  a  discrete  or 

A  (  Q  ) 

continuous  set  of  pre-defined  vectors.  The  vectors  F  might  be  called 
characteristic  feature  vectors,  codewords,  cluster  centers,  models, 
reproductions,  etc.  An  example  of  such  a  problem  occurs  in  speech  analysis, 
where  the  measurements  F.  are  estimates  of  autocorrelation  function  values, 

l 

which  can  be  expressed  as  expectations  with  respect  to  some  underlying 

distribution  [1],  [2].  In  speech  recognition  applications,  the  identity  of 
a(s) 

the  best  codeword  F  could  be  used  to  identify  the  speech  sound  or  perhaps 
the  speaker;  in  speech  transmission  applications,  the  identity  of  the  best 
codeword  can  be  transmitted  as  part  of  a  narrow  bandwidth  encoding  of  the 
speech  [2],  [3]. 

Most  of  the  literature  on  nearest-neighbor  classification  deals  only  with 
Euclidean  or  other  metric  distortion  measures  [4], [5].  In  contrast,  we 
consider  an  information-theoretic  distortion  measure  that  is  not  a  metric,  but 
that  nevertheless  leads  to  a  classification  algorithm  that  is  optimal  in  a 
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well-defined  sense  and  is  also  computationally  attractive.  Furthermore,  the 
distortion  measure  results  in  a  simple  method  of  computing  cluster  centroids. 
Our  approach  exploits  properties  of  the  cross-entropy  between  any  two 
probability  density  functions  q,p,  defined  by 


Hfq.pl 


fdx  q(x) log(q(x)/p(x) )  , 


(3) 


if  the  measure  induced  by  q  is  absolutely  continuous  with  respect  to  that 
induced  by  p,  and  H(q,p]  ■ otherwise  [6], [7].  In  particular,  our  approach 
is  based  on  cross-entropy  having  unique  properties  as  an  information  measure 
[6] -[8]  and  on  cross-entropy  minimization  having  unique  properties  as  an 
inference  procedure  [9].  The  approach  can  be  viewed  as  a  refinement  of  a 
general  classification  method  due  to  Fullback  [6,  p.  83].  The  refinement 
exploits  special  properties  of  cross-entropy  that  hold  when  the  probability 
densities  involved  happen  to  be  minimum  cross-entropy  densitites  110], [11]. 
The  approach  is  a  generalization  of  speech  coding  by  vector  quantization 


[21,  [3], 


Section  II  reviews  relevant  properties  of  cross-entropy  and  cross-entropy 
minimization  and  Section  III  presents  the  minimum  cross-entropy  solution  to 
the  classification  problem.  Section  IV  considers  the  cluster  analysis  problem 
of  choosing  appropriate  feature  vectors  F^8^.  An  example  concerning 
narrow-band  speech  transmission  is  disscussed  in  Section  V,  and  a  general 
discussion  follows  in  Section  VI. 


II.  THEORETICAL  BACKGROUND 

Suppose  you  have  a  prior  estimate  p  of  the  unknown  probability  density 
qf(x),  you  obtain  new  information  about  q*  in  the  form  of  expected  value 
constraints  (1),  and  you  need  to  choose  a  posterior  estimate  q  that  is  in  some 
sense  the  best  estimate  of  q*  given  what  you  know.  Which  one  should  you 
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choose?  The  principle  of  minimum  cross-entropy  provides  a  general  solution  to 
this  inference  problem  [9].  The  principle  states  that,  of  all  the 
distributions  that  satisfy  the  constraints,  you  should  choose  the  posterior  q 
with  the  least  cross-entropy  (3)  with  respect  to  the  prior  p.  As  a  general 
method  of  statistical  inference,  cross-entropy  minimization  was  first 
introduced  by  Kullback  [6].  The  name  cross-entropy  is  due  to  Good  [12]. 

Other  names  include  expected  weight  of  evidence  [13,  p.  72],  directed 
divergence  [6,  p.  7],  discrimination  information  [6,  p.  37],  and  the  entropy 
of  one  distribution  relative  to  another  [7,  p.  19].  The  principle  of  maximum 
entropy  [14], [15]  is  a  special  case  of  cross-entropy  minimization  under 
appropriate  conditions  [2] ,  [9] . 

A.  Minimum  Cross-Entropy  Probability  Densities 

Given  a  positive  prior  probability  density  p,  if  there  exists  a  posterior 
that  minimizes  the  cross-entropy  (3)  and  satisfies  the  constraints  (1)  and 

^Ix  q(x)  =  1  ,  (4) 

then  it  has  the  form 

/  M  \ 

q(x)  =  p(x)  exp  f-  >  -  ^  ^kfk*5y  »  (5) 

'  k-0 

with  the  possible  exception  of  a  set  of  points  on  which  the  constraints  imply 
that  q  vanishes  [6,  p.  38], [10].  In  (5),  ^  and  A  are  Lagrangian  multipliers 
whose  values  are  determined  by  the  constraints  (1)  and  (4).  Conversely,  if 
one  can  find  values  for  and  ^  in  (5)  such  that  the  constraints  (1)  and 
(4)  are  satisfied,  then  the  solution  exists  and  is  given  by  (5)  [10]. 
Conditions  for  the  existence  of  solutions  are  discussed  by  Csiszar  [10].  The 
cross-entropy  at  the  minimum  can  be  expressed  in  terms  of  the  Lagrangian 
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multipliers  and  the  as  follows  ([6,  p.  38],  [11]): 

M 


H[q,p]  =  -  *  -  ^  fa 


It  is  necessary  to  choose  and  the  ^  so  that  the  constraints  are 
satisfied.  In  the  presence  of  the  constraint  (4),  one  may  rewrite  the 
remaining  constraints  (1)  in  the  form 


Jdx  (Mi)  ~  Fjq(x)  *  0  (7) 

Now,  if  one  finds  values  for  the  such  that 

jdx  (f.(x)  -  Fjp(x)  exp  *  0  *  *  0 , . . .  ,M) ,  (8) 

holds,  (7)  will  be  satisfied,  and  (4)  can  then  be  satisfied  by  setting 


r  ( 

M  \ 

/\  =  log 

J 

|dx  p(x)  exp  ^ 

k-0 

• 

(9) 

If  the  integral  in  (9)  can  be  performed,  one  can  sometimes  find  values  for 
the  ^  from  the  relations 


It  unfortunately  is  usually  impossible  to  solve  this  or  (8)  for  the  ^ 
explicitly,  in  order  to  obtain  a  closed-form  solution  expressed  directly  in 
terms  of  the  known  expected  values  Ffc  rather  than  in  terms  of  the  Lagrangian 
multipliers.  Computational  methods  for  finding  approximate  solutions  are, 
however,  available  ([11],  [16]). 


B.  Justification  and  Properties  of  Cross-Entropy  Minimisation 

In  this  Section,  ve  discuss  justifications  for  the  principle  of  minimum 
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cross-entropy,  and  we  summarise  three  important  properties  o£  cross-entropy 
minimization  that  lead  to  the  classification  method  described  in  Section  III. 
For  general  statements  and  proofs  of  these  and  other  properties,  see  [11]. 

In  what  sense  does  cross-entropy  minimization  yield  the  best  estimate  of 
q1 ?  To  answer  this  question,  it  is  useful  and  convenient  to  view 
cross-entropy  minimization  as  one  implementation  of  an  abstract  information 

operator  *  that  takes  two  arguments  -  a  prior  and  new  information  -  and 

yields  a  posterior.  Thus,  we  write  the  posterior  q  as  q  ■  pel,  where  I  stands 
for  the  known  constraints  (1)  on  expected  values  plus  the  usual  normalization 
constraint  (4).  Recent  work  has  shown  that,  if  the  operator  o  is  required  to 
satisfy  certain  axioms  of  consistent  inference,  and  if  e  is  implemented  by 
means  of  functional  minimization,  then  the  principle  of  minimum  cross-entropy 
follows  necessarily  [9] .  Informally,  the  axioms  of  o  may  be  phrased  as 
follows: 

1)  Uniqueness.  The  results  of  taking  new  information  into  account  should 
be  unique. 

2)  Invariance.  It  shouldn't  matter  in  which  coordinate  system  one  accounts 
for  new  information. 

3)  System  independence .  It  shouldn't  matter  whether  one  accounts  for 
independent  information  about  independent  systems  separately  in  terms 
of  different  probability  densities  or  together  in  terms  of  a  joint 
density. 

4)  Subset  Independence.  It  shouldn't  matter  whether  one  accounts  for 
information  about  an  independent  subset  of  system  states  in  terms  of  a 
separate  conditional  density  or  in  terms  of  the  full  system  density. 

For  the  formal  statements,  see  [9].  In  terms  of  these  axioms,  the  principle 
of  cross-entropy  minimization  is  correct  in  the  following  sense:  Given  a 
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ir  — 

t  * - 


■  ~r 


prior  probability  density  and  new  information  in  the  form  of  constraints  on 
expected  values,  there  is  only  one  posterior  density  satisfying  these 
constraints  that  can  be  chosen  in  a  manner  that  satisfies  the  axioms;  this 

unique  posterior  can  be  obtained  by  minimizing  cross-entropy. 

An  additional  interpretation  of  the  sense  in  which  q  ■  p»I  is  the  best 
estimate  of  q*  rests  on  cross-entropy's  well-known  [6]  and  unique  [8] 
properties  as  an  information  measure.  For  example,  cross-entropy  satisfies 

H[q,pU  0  ,  (10) 

with  equality  only  if  p  *  q  almost  everywhere.  Also,  if  the  space  on  which  p 
and  q  are  defined  is  the  product  of  two  sample  spaces  X^  and 
X2>  and  if  p  and  q  have  the  product  form 

p(x1?x2)  -  p1(x1)p2(x2) 

and 

q(x1,x2)  =*  q1(x][)q2(x2), 

then 

H[q,p]  »  H[q1q2,p1p2l  -  HCq^pj]  +  H[q2,p2] 

holds.  Informally  speaking,  H[q,p]  is  a  measure  of  the  "information 
divergence"  or  "information  disimilarity"  between  q  and  p.  In  these  terms, 
one  can  interpret  the  principle  of  minimum  cross-entropy  as  follows:  Since 
q  *  p«I  minimizes  H[q,p] ,  the  posterior  hypothesis  for  q*  is  as  close  as 
possible  in  an  information-measure  sense  to  the  prior  hypothesis  while  at  the 
same  time  satisfying  the  new  constraints  I.  Owing  to  cross-entropy's 
properties  as  an  information  measure,  H[q,p]  has  been  proposed  as  a  measure  of 
the  distortion  introduced  if  p  is  used  instead  of  q  [17],  even  though  H  does 
not  have  properties  of  a  metric.  (For  example,  it  does  not  satisfy  a  general 
triangle  inequality).  In  the  context  of  cross-entropy  minimization,  however, 


6 


there  is  a  much  stronger  justification  for  using  cross-entropy  as  a  distortion 
measure.  In  particular,  the  following  property  holds  (see  [10],  [11]): 


Property  A  ( triangle  equality) .  Let  I  be  the  constraints  (1)  and  let  p  be 
any  prior.  Then 

H[q"*  ,pl  =  H[q+,p«I]  +  H[p«I,p]  .  (11) 

Thus,  the  minimum  cross-entropy  posterior  estimate  of  is  not  only  logically 
consistent,  but  also  closer  to  q* ,  in  the  cross-entropy  sense,  than  is  the 
prior  p.  Moreover,  the  difference  H[q^  ,p]-H[q*,poI]  is  exactly  the  cross¬ 
entropy  H[p«I,p]  between  the  posterior  and  the  prior.  Hence,  H[p®I,p]  can  be 
interpreted  as  the  amount  of  information  provided  by  the  constraints  I  that  is 
not  inherent  in  p.  Stated  differently,  H[poI,p]  is  the  amount  of  additional 
distortion  introduced  if  p  is  used  instead  of  pel.  Since,  for  any  density  r 
there  exist  constraints  Ir  such  that  r  ■  p®If  for  any  prior  p,  H[r,p]  is 
in  general  the  amount  of  information  needed  to  determine  r  when  given  p,  or 
the  amount  of  additional  distortion  introduced  if  r  is  used  instead  of  p  [11], 
Additional  justification  for  using  cross-entropy  as  a  distortion  measure 
in  the  context  of  cross-entropy  minimization  is  provided  by  the  following 
property: 

Property  B  (expected  value  matching):  Let  1(F)  be  the  constraints  (1)  for 
a  fixed  set  of  functions  f^  and  let  q  ■  p»I  be  the  result  of  taking  this 
information  into  account.  Then,  for  an  arbitrary  fixed  density  q*,  the 
crost;  entropy  H[q*,q]  ■  H[q*,p»l]  has  its  minimum  value  when  the 
constraints  (1)  satisfy 

-  F*  -  jdx  q*(x)fk(x). 

This  is  a  generalisation  of  a  property  of  orthogonal  polynomials  [18,  p.  12] 
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that,  in  the  case  of  speech  analysis,  is  called  the  "correlation  matching 
property"  [19,  Ch.  2].  Property  B  states  that,  for  a  density  q  of  the  general 
form  (5),  H[q*,q]  is  smallest  when  the  expectations  of  q  match  those  of  q*. 

In  particular,  it  follows  that  q  “  p»I  ia  not  only  the  denaity  that  minimizes 
H[q,p],  as  already  discussed,  but  also  is  the  density  of  the  form  (5)  that 
minimizes  H[q+,q]!  Hence  p»I  is  not  only  closer  to  q*  than  is  p  —  as  shown 
by  the  convenient  form  (11)  —  but  it  is  the  closest  possible  density  of  the 
form  (5). 

Another  property  of  cross-entropy  minimization  that  we  shall  need  in 
Section  III  is  the  following: 


Property  C.  Let  Ij  and  I 2  stand  respectively  for  the  constraints 


and 


jdx  fi(x)q^(x) 
jdx  fXxlq^x) 


f 


which  involve  the  same  set  of  functions  f^,  i  ■  Then 

(poljjolj  ■  p«I2  (12) 

and 

M 

Htq2,p]  -  H[q2,qiJ  ♦  Hjq^p]  ♦  V  flj1)<p^1)  -  F<2>>  (13) 

hold,  where  q^  ■  p*Ij,  q2  “  p*I2»  and  the  «re  the 

Lagrangian  multipliers  associated  with  q^  “  pol^. 

Suppose  that  q^  and  qj£  are  the  system  probability  densities  at  two 
different  times,  and  suppose  that  q*  or  estimates  of  q*  are  considered  to 
be  reasonable  prior  estimates  of  qj.  That  is,  p*I1  is  considered  to  be  a 


8 


"m*  k  V 


*▼* —  - 


reasonable  prior  estimate  of  q^.  Property  C  states  that,  when  I,,  is 
determined  by  expectations  of  the  same  functions  as  1^,  the  results  of 
taking  1^  into  account  are  completely  wiped  out  by  subsequently  taking 
into  account. 


III.  CLASSIFICATION  METHOD 


We  now  consider  the  problem  outlined  in  Section  I.  Let  I  denote  the 
constraints  (1)  associated  with  the  feature  vector  J,  and  let  Ig  denote 
analogous  constraints  associated  with  each  of  the  pre-defined  codewords 
$<•>, 


(dx  f.(x)q^(x)  85 

~  1  ^  a  ^  l 

V 


(14) 


Suppose  that  p  is  an  estimate  of  q^  that  is  available  prior  to  learning^. 


Then  the  best  posterior  estimate  of  q+  is 


poi 


(15) 


in  the  sense  discussed  in  Section  II.  Now,  let  qg  ■  p»Ig  be  the  minimum 
cross-entropy  estimates  of  q*  that  would  apply  if  the  current  feature  vector  g 
were  equal  to  the  codeword  F,  .  As  discussed  in  Section  II,  H[q,qg]  is 

the  amount  of  information-theoretic  distortion  introduced  if  q  is  represented 
by  qg.  It  is  therefore  reasonable  to  define  the  distortion  measure  between 

a(b) 

F  and  r  as 


D(F,F(s))  -  H[q,qs], 


(16) 


where  q  *  p*I  and  q 
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p»I  .  The  nearest  neighbor  classification  rule  (2) 
s 


then  becomes:  find  t  such  that 
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t 


(17) 


m  Htq+»ql  ♦  H[q,qg]  •  (20) 

H[q*  ,qg ]  is  the  amount  of  information  needed  to  determine  q^  given  the 

codeword  q^,  or  the  amount  of  distortion  introduced  when  q*  is  represented 

by  q  .  Equation  (20)  states  that  the  total  distortion  H[q*  ,q  ]  is  the  sum 
8  8 

of  the  distortion  introduced  when  q*  is  represented  by  the  best  posterior 
estimate  q  *  p«I  plus  the  distortion  introduced  when  q  is  represented  by  the 
codeword  qg.  Since  q  minimizes  H[q^  ,q]  in  the  sense  defined  in  Property  B, 
(20)  shows  that  the  classification  rule  (17)  is  optimal  in  the  sense  of 
minimizing  the  total  distortion  H[q^  , .  The  rule  (17)  is  equivalent  to 
the  minimum  discrimination  classification  method  of  Kullback  [6,  p.  83]  since 
q  *  »>y  (19),  which  shows  that  the  Kullback  method  is  optimal  in  a  sense 

that  has  not  been  appreciated  previously.  Notice  that  when  q  is  in  the 
codeword  set  —  q  ■  qg  for  some  sfeA  —  the  rule  (17)  just  selects  q,  the 
best  posterior  estimate  of  q* . 

Minimizing  H[q,qt]  identifies  the  associated  codeword'^^ .  Now,  the 
quantity  H[q,qtJ  ■  H[qtoI,qt]  is  the  amount  of  information  provided  by  I 
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that  was  not  already  inherent  in  q£.  We  can  therefore  restate  the  solution 

A  (  t  ) 

in  terms  of  the  problem  as  originally  posed  —  choosing  the  codeword^ 
that  "best  represents"  the  feature  vector  £  —  in  the  following  way:  Choose 
the  codeword  F  such  that  the  feature  vector  F  provides  the  least 
additional  information  beyond  what  F  provides. 

We  now  consider  the  computational  requirements  of  the  classif ication 
method.  At  this  point  we  specialize  to  the  case  in  which  there  is  a  discrete 
set  of  n  codewords  F  J  .  Given  an  input  feature  vector  F  of  M+l 
measurements  F^,  the  classification  procedure  may  be  summarized  as  follows: 

a)  compute  q  *  p®I,  where  I  represents  the  constraints  (1); 

b)  compute  H[q,q J  (j  -  l,...,n),  where  *  pel^  and 
represents  the  constraints  (14),  and  find  a  value  j  such  that 

for  i  *  J* 

Now,  owing  to  property  C,  it  turns  out  that  the  first  step  is  unnecessary:  this 
"two-step"  procedure  reduces  to  a  single  step.  From  (13),  it  follows  that 

M 

*  Hfq.pl  -  H[$.,pJ  -  £  -  Ffc) 

k-0 

or 

M 

-  H[q,p)+yj)  ♦  £  fk^'k  (21) 

k*0 

holds,  where  we  have  substituted  for  Hlq^,p]  by  means  of  (6).  In  (21)  the 
F^  are  components  of  the  the  input  feature  vector  F  and  the  ^  J  and 
are  Lagrangian  multipliers  associated  with  cj j  “  P*$j.  Since  the 
^  are  known  ahead  of  time,  these  multipliers  can  be  computed  ahead  of  time 
Ill,  Appendix  A], 116].  Now  the  quantity  H[q,p]  is  a  constant  for  any  feature 
vector  £,  so  that  the  closest  codeword  F^^  can  be  determined  by  finding  the 


II 


smallest  of  the  quantities 


-  »(j)  • 


which  does  not  involve  having  to  compute  q  *  p»I.  Computing  each 
requires  M+l  multiplications  and  additions  involving  M+2  pre-stored 
multipliers  and  the  M+l  elements  of  the  input  vector  F.  If  there  are  n 
possible  codewords,  the  total  requirement  is  n(M+l)  multiplications  and 
additions,  storage  for  n(M+2)  Lagrangian  multipliers,  and  approximately  n 
comparisons  (to  find  the  smallest  Aj).  One  can  also  trade  about  n(M+l)/2 
of  the  multiplications  for  additions  [20].  Since  the  Aj  can  be  computed 
independently,  concurrent  computation  is  possible. 

These  results  are  a  generalization  of  the  method  of  speech  coding  by 
vector  quantization  [3] , [2] ,  which  exploits  a  special  case  of  (20)  that  was 
found  to  hold  for  a  speech  spectral  distortion  measure  due  to  Itakura  and 
Saito  [21], [22].  Under  suitable  assumptions,  the  Itakura-Saito  distortion 
measure  can  be  shown  to  be  a  special  case  of  assymptotic  cross-entropy  rate 
[2],  [22].  In  Section  V,  we  show  how  speech  coding  by  vector  quantization 
follows  as  a  special  case  from  (22). 


IV.  COMPUTATION  OF  CLUSTER  CENTROIDS 


Suppose  that  a  cluster  of  measurement  vectors  _FV  ,  i  ■  1,...,N,  is  to 
be  represented  in  the  classification  procedure  of  Section  III  by  a  single, 
"centroid"  codeword  £.  For  example,  the  F  might  result  from  measurements 

A 

on  N  members  of  the  class  to  be  represented  by  £.  How  should  one  determine 
from  the  F^*^? 

The  selection  of  centroids  is  a  key  facet  of  cluster  analysis  techniques 
such  as  the  k-means  technique  [5]  or  the  ISODATA  technique  [23] ,  and  it  is 
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also  important  in  the  design  of  vector  quantizers  [24] .  When  the  distortion 
measure  is  the  Euclidean  distance,  centroids  are  simply  Euclidean  centers  of 
gravity.  For  more  general  distortion  measures  as  in  (16),  a  natural 

generalization  of  the  Euclidean  centroid  [24]  of  a  collection  ^ ;  i*l...N] 

.  a  .  .  . 

is  the  vector  F  minimizing  the  average  distortion, 


DC(~}  =  *X  D(-(l)’^  *  sX  Htqi,Aql 


*  ^  r* 

where  q.  =  p«I^  and  q  =  p»I.  Here,  1^  and  I  stand  for  expected  value 

(  i  )  A 

constraints  of  the  form  (1)  corresponding  respectively  to  F  and  F. 

Perhaps  surprisingly,  the  centroid  for  this  apparently  complicated 
non-Euclidean  distortion  measure  can  be  readily  evaluated  because  of  the 
special  properties  of  Section  II.  In  fact,  we  show  below  that  the  minimum  of 

A  A  f  #  4 

D  (F)  is  achieved  simply  by  the  components  of  F  each  being  the  arithmetic 
c  *»  *' 

mean  of  the  components  of  the  F^\ 

A 

Since  I.  and  I  involve  the  same  constraint  functions  f^,  Property  C 
applies.  Eq.  (13)  yields 


H[qi,q]  =  H[q 


M 

j.pl  -  H[$,p]  -  Xfr^r  "  Fr0)  * 


A 

where  the  fl l  are  Lagrangian  multipliers  associated  with  q  =  p»I.  It  follows 


that  (23)  becomes 


N  M 

=  w  X  H[qi’pI  ■  Hl^’p]  “  Z.h(*r  "  V  ’ 


where  the  Fr  are  components  of  the  mean  constraints 


*  I  *a> 


Now,  let  q  *  p°I,  where  T  represents  the  mean  constraints  F.  Then  (13)  yields 
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H[q,p]  -  H[q,p] 


H[q,q] 


-XH- v 

r*0 


By  combining  this  with  (24),  we  obtain 

N 

.  I 

D  (F)  «  H[q,q]  -  H[q 


c  ^ 


11 

fq.pJ  +  4  V H[q. » 


n  z_  ’‘I 

i”! 


pi 


(26) 


A  A 

Since  DC(F)  depends  on  F  only  through  the  first  term,  minimizing  DC(F)  is 

equivalent  to  minimizing  H[q,q] .  This  minimum  occurs  when  q  *  q  (see  (10), 

A  .  A  —  — 

which  in  turn  means  that  the  optimal  centroid  F  is  F  *  F,  where  is  given  by 
(25).  Hence  the  components  of  the  cluster  centroid  F  are  just  the  arithmetic 
means  of  the  components  of  the  cluster  elements  F^\ 

V.  EXAMPLE  -  SPEECH  CODING  BY  VECTOR  QUANTIZATION 

Speech  coding  by  vector  quantization  is  a  recently  developed 

narrow-bandwidth  speech  coding  technique  based  on  Linear  Prediction  Coding 

(LPC)  [3],  [2].  Based  on  estimates  of  the  sample  autocorrelation  function  that 

are  measured  in  each  frame,  the  speech  in  each  frame  is  coded  in  terms  of  the 

identity  of  a  prestored  set  of  LPC  parameters  called  a  codeword.  The  LPC 

.  2 

paramaters  used  are  the  inverse  filter  gain  <T  and  sample  coefficients  a^, 
i  =  0, . . . ,M,  with 


1 


(27) 


These  parameters  characterize  a  filter  that  is  used  in  synthesizing  the  speech 
after  decoding.  The  nearest-neighbor  distortion  rule  used  in  coding  the 
speech  selects  in  each  frame  the  codeword  that  has  the  smallest  Itakura-Saito 
distortion  [21], [22]  with  respect  to  the  current  frame  of  speech.  In 
particular  ([2], [3]),  one  finds  the  codeword  with  parameters  that  minimize  the 
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-tr 


expression 


<r2 


Jr  (O)r  (0)  +  2 

7  x  a 


r 

x 


(s)ra(s) ! 


+  log(tr2) 


(28) 


where 

M-s 

X  Vi*.  ’  ’  4  “ 

1*°  ,  (29) 

0  ,  otherwise 

and  where  the  r  (s)  are  measurements  that  estimate  the  autocorrelation 
x 

function  of  the  speech  in  the  current  frame  for  lags  s  ■  0,1,..., M. 

For  convenience,  we  omit  indexing  <r  and  the  a^  with  the  codebook  parameter  j 
over  which  (28)  is  minimized. 

Gray,  £t  al.  [2]  have  shown  that  the  minimization  expression  (28)  can  be 
derived  by  means  of  cross-entropy  minimization.  Here,  we  shall  show  that  (28) 
is  a  special  case  of  the  general  expression  (22).  In  doing  so,  we  shall  use 
some  results  from  [2],  In  [2],  the  derivation  is  conducted  in  terms  of 
codebook  probability  densities  q(y),  where  y  ■  yQ,...,yk_^  a  vector  of 
k  time  domain  signal  samples.  In  particular,  each  codebook  entry  corresponds 
to  an  autogregressive  model  of  the  form 


y.  *  u.  (i  *  -M, . . .-1) 

Jx  1 

M 

y  ■  -  >  a.y  .  +  <re  , 

n  Z_  j  n~J  n 

T=i 

where  the  are  the  initial  conditions  for  the  filter,  where  the 
yn  are  time-domain  signal  samples,  and  where  en  is  a  zero-mean, 
unit-variance  sequence  of  independent  Gaussian  random  variables.  The 
vector  y  can  be  expressed  in  the  form  y  ■  A  ^ (ce  +  y),  where  k,  is  a  banded, 


triangular  matrix  whose  components  are  the  inverse  filter  sample  coefficients, 


(A)..  = 

ij 

r 

3  •  •  j 

i-j 

.0 

0  4i-j 

otherwise 

t 

(30) 

and  where 

s 

(v)  * 

~  n 

V 

-  >  a  .u. 

«-3  J 

j*n-M 

n  <  M 

(31) 

o, 

V- 

n»M 

Each  codebook  density  q(y)  is  Gaussian, 


q(y)  =  (2ir)k/,2(det  R2>  ^expf-Cy  -  -  m)/2)  ,  (32) 

with  mean 

m  =  E(y)  '*  A  *v  ,  (33) 

I 

and  covariance 

R,  =  E((y-E(y))(y-E(y))t) 

*  cr2  (AtA)"1  ,  (34) 

where  E  stands  for  expected  value  and  where  t  indicates  a  transpose  operation. 
For  convenience,  we  omit  the  voicing  parameter  that  is  part  of  the  analysis  in 
(2).  Our  results,  however,  are  unaffected  by  this  omission. 


In  [2],  the  expression  (28)  is  obtained  by  using  constraints 


^(y^j)  *  rx(Ji-jl)  ,  (i-jJ^M 

(35) 

E(y)  -  0 

(36) 

for  each  speech  frame,  applying  Kullback's  classification  procedure  (6,  p.  83] 
to  select  a  codebook  entry  q,  and  taking  the  k-»  oc  limit  of  the  per-aymbol 

s 

cross-entropy  so  that  the  classification  is  based  on  the  stable,  non-transient 
behavior  of  the  autoregressive  models. 


16 


The  codebook  densities  (32)  were  derived  in  [2]  directly  from  arguments 
concerning  the  speech  reproduction  model  class.  For  our  purpose  here  — 
applying  the  results  of  Section  III  —  we  need  to  express  the  codebook 

•  •  0  •  A  ^  • 

densities  as  minimum  cross-entropy  densities  q  *  p»I,  for  some  fixed  prior  p 
and  codebook-dependent  constraints  This  is  accomplished  ( [2] , f 6 J , [10] , [11] ) 
by  the  prior 

p(y)  *  (2ir)  k/^2exp[-^tIy/2j  ,  (37) 

where  I  is  the  identity  matrix,  and  by  the  constraints 

E(yiyj^  *  (E2  +  S^ij  »  ji-jl^M  (38) 

E(y)  *  m  .  (39) 

We  can  now  make  the  connection  with  the  results  of  Section  III.  There  we 
showed  that  the  best  codeword  is  determined  by  the  minimum  of  the  quantities 
(22), 

*  ■  •  m) 

which  we  have  rewritten  here  without  the  codebook  index  j.  In  terms  of  the 
foregoing,  the  measurements  Ffc  are  given  by  the  right  hand  sides  of 
(35)-(36),  the  are  the  Lagrangian  multipliers  in  q  *  p«I  that  correspond 

A 

to  the  constraints  (38)-(39),  and  %  is  the  normalization  multiplier.  The 
Lagrangian  multipliers  can  be  identified  in  (32)  by  noting  that  q  must  have 
the  general  form  (5).  After  factoring  out  the  prior  (37),  it  is  easy  to  see 

A 

that  &  is  given  by 

exp[-*]  -  (det  g2)  1^2exp[-gtg21m/2j  (41) 

and  that  the  terms  in  the  exponential  in  (32)  have  the  factors 

(l/2)(jg2*  -  which  are  therefore  the  Lagrangian  multipliers 

corresponding  to  the  constraints  (38).  We  do  not  need  the  multipliers 
corresponding  to  (39)  since,  owing  to  (36),  they  do  not  contribute  to  the  sum 
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^kj^k^k  in  (40).  Eq.  (40)  therefore  becomes 

A  '  >A+iZ  rx<li-jl ^h1  "Pij 

'  *  *  i  X!  rx(|i' 

|i-j  iM 


i-ji  -  K«» 


(42) 


The  lest  term  cannot  affect  the  minimization  since  it  doesn't  depend  on  the 
codebook  entry;  we  therefore  drop  it.  From  (39),  we  have 

X  -  +  det  h 

1  t„“l  1,  .  2/,t.\“l 

*  -xm  R„  m  +  -x-log  det  c r  (A  A) 

2  ~2  ~  2  ~ ~ 

1  t„-l  k,  _2  .  1,  .  .  .  .  1-  .  .  .t 

■  2®  Rj  a  +  j108  T  2l0g  det  ~  2log  det  ~ 

1  t_-l  k,  2 

*  2^  &2  2  2log  r 


where  we  have  used  (34)  and,  in  the  last  step,  the  fact  that  A  is  triangular 
with  A^.  *  aQ  =  1  from  (30)  and  (27).  Since  minimizing  A  in  (42)  is 
equivalent  to  minimizing  2A/k,  it  follows  that  the  best  codeword  can  be  found 
by  minimizing 


1  t  -1 

k2  52  5 


log(  <r 


>*t  Z  ><•;■>, 


(43) 


t  2 

Now,  the  first  term  in  (43)  evaluates  to  (v  v)/ko-  by  means  of  (33)-(34). 

7  *v  <v 

Since  v  has  a  fixed  number  of  M+l  non-zero  terms  (see  (31)),  it  follows  that 

/v 

this  term  goes  to  zero  as  k-*  <*  . 

Expanding  (34)  by  means  of  (30)  leads  to 


min(i+M, j+M,k-l) 


Since  Rj  is  symmetric, 


s  .a 

n-i  n-j 


n“max(i, j) 

it  suffices  to  consider  the  case  i>j,  for  which 


(44) 
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(44)  becomes 


(R~  ).. 

~2  ij 


min(M+j,k-l) 

£ 


n*i 


a  .a 

n-i  n-j 


T*  ‘  aaas-*fc-j| 

s*0 


Provided  that 

min(i,j)<  k-l-M 
holds, 


(45) 


'.<  Ml  > 


»  ji“jl  4  M  , 


follows  from  (29).  Equation  (43)  is  therefore  equivalent  to 
T*  log(o-2)  +  i  ^  rx(  |i-jl  )ra(|i-jl  ) 


|i“jUM 


*  log(<r'4)  +  jr  (0)r  (0)  +  2  >  r(s)r(s) 

x  a  /  x  a 


>}  • 


(46) 


which  is  the  same  as  (22).  In  deriving  the  last  term  of  (46),  we  have  ignored 

corrections  necessary  for  proper  evaluation  of  the  sum  at  the  matrix 

2 

boundaries.  However,  these  corrections  involve  only  (M+l)  terms  (see  (45)) 
and  therefore  become  negligible  as  k-*oe  . 


VI.  GENERAL  DISCUSSION 

The  special  properties  of  cross-entropy  that  hold  for  minimum 
cross-entropy  densities  [11]  result  in  a  pattern  classification  method  with 
several  advantages:  It  is  optimal  in  a  well-defined,  information-theoretic 
sense;  it  is  computationally  attractive;  and  it  includes  a  self-consistent, 
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simple  method  of  computing  the  set  of  cluster  centroids  in  terms  of  which  the 
classification  is  made.  A  special  case  of  this  method  (speech  coding  by 
vector  quantization)  has  already  proved  to  be  successful.  It  therefore  seems 
likely  that  the  method  can  be  used  successfully  in  a  variety  of  other 
applications . 
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